水母君98-CSDN博客

原创 Big dataTechnology foundation 概念选择题

INote: There may be more than one correct answer.(1) The characteristics of big data include:(1) A B CA. VolumeB. VarietyC. Velocity D. Simplicity(2) The characteristic of cloud computing solutions include: A. Dynamic provisioning(2) A B CB. Scala

2020-12-14 20:28:52 584

原创 RDD用法与实例（十四）：closure和accumulators的区别和实例

1.RDD的特性：1.persistent2.lazy transformation2.Cluster mode集群模式Only one master/worker can run on the same machine, but a machine can be both a master and a worker3.where to runMost run on driverstransformations run on executorsactions - executors an.

2020-12-13 19:52:25 241

原创 RDD用法与实例（十三）：spark与MapReduce的对比

Hadoop MapReduce：只用磁盘，只用MapReduce方法，batch模型，只支持Java。spark内存或磁盘，很多方法，BIS（batch，interactive，streaming），支持多种语言

2020-12-13 19:05:31 438

原创 spark大数据基础复习：一些概念

1.大数据的定义“Big Data” is data whose scale, complexity, and speed require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it…————————————————————————————————————2.大数据的特征Characteristics of Bi

2020-12-13 16:31:57 257 2

原创图像处理：Lossless compression/Huffman coding/霍夫曼编码计算题

注意：编码时候挑概率大的分支设为1计算过程：

2020-12-12 16:25:07 982

原创图像处理之Adaboost计算实例

想起来的时候只保存了这一题。。不过也没有答案

2020-12-12 16:15:15 203

原创第六章：图像处理之图像分割Image segmentation

outline– Point, line and edge detection (Prewitt, Sobel and Laplacian of Gaussian)– Hough transform, e.g., straight line, circle, etc– Thresholding (Global and Adaptive)– Statistical mixture model– Expectation-maximisation method– Morphological Water

2020-12-11 14:35:59 770

原创第四章：图像处理之图像恢复与滤波Image Restoration and Filtering

Image Restoration and Filtering• Image restoration and filtering– Random noise distributions and parameter estimation– Periodic noise– Arithmetic and Geometric mean filters– Harmonic and Contraharmonic mean filters– Order-statistics filters• Median,

2020-12-11 01:43:50 1515 2

原创第三章：图像处理之频域图像增强Frequency domain image enhancement

• Frequency domain image enhancement– Discrete Fourier transform and inverse– Notch filter– Low-pass filtering, e.g., Ideal, Butterworthand Gaussian– Image power– High-pass filtering, e.g., Ideal, Butterworth and Gaussian————————————————————————————

2020-12-10 14:17:37 872

原创第二章：图像处理之空域图像增强Spatial domain image enhancement

outline• Spatial domain image enhancement– Intensity transformation functions • Inverse or Image Negatives• Brightening, e.g., log• Darkening, e.g., nth power• Power-law transformation • Intensity normalization– Histogram Equalization • Contrast stre

2020-12-09 21:11:03 901

原创第一章：图像处理之图像表示Image Representation

outline• Image representation– Pixel = picture element– Sampling– Quantization1.PixelIndividual elements are called:image elements, picture elements (pixels), or image points.2.sampling强度值的采样率是图像分辨率。Sampling rate on intensity values is the image

2020-12-09 20:34:42 1107

原创第八章：图像处理之图像压缩ImageCompression

Outline• Introduction• Fundamentals• Coding, interpixel and psychovisual redundancies • Fidelity criteria• Image compression model• Source encoder and decoder• Channel encoder and decoder• Lossless compression• Huffman encoding• Lossless predictiv

2020-12-04 15:33:40 2810

原创 pyspark报错 java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI

在这里我用的是spark3.0 + scala2.12版本首先在cmd上启动pyspark这里有一个小度量，第一次使用参数启动pyspark，以便它下载所有graphframe的jar依赖项，很多教程启动的时候并没有指定依赖包，这可能会发生错误：（根据你的spark版本去graphframe官网找到对应的下载命令）官网链接：graphframes比如我下载对应的0.8.0-spark3.0-s_2.12 后，将它放入spark启动时对应的文件目录下在终端输入pyspark --packages

2020-11-24 16:38:56 960

原创使用DGL环境报错：Setting the default backend to “pytorch“,BACKEND can be chosen from mxnet, pytorch, tensor

在用python处理图神经网络的时候，常用到DGL，由于改之前忘记截图保存整个错误描述，只记得标题的关键词大意：要将一个dgl文件夹下的json文件的backend默认值设置成pytorch，打开看了之后它默认值就是pytorch很疑惑。随便试了一下环境配置文件中最后加了一行export GLBACKEND="pytorch"保存即可成功。另外附一下mac系统改环境变量的方法：在终端输入记得后缀选择.bash_profile...

2020-11-01 17:46:55 7212 10

原创 RDD用法与实例（十二）：实现PageRank

import refrom operator import adddef computeContribs(urls, rank): # Calculates URL contributions to the rank of other URLs. num_urls = len(urls) for url in urls: yield (url, rank / num_urls)def parseNeighbors(urls): # Parses a

2020-10-26 22:41:19 254

原创 sparksql学习笔记（一）：基本操作

1.1 读入数据df = spark.read.csv('building.csv', header=True, inferSchema=True)1.2 展示数据# show the content of the dataframedf.show()1.3 展示数据类型df.printSchema()2.1 dataframe创建rdd型数据库dfrdd = df.rdddfrdd.take(3) #展示前三组，看看是什么样...

2020-10-26 16:03:24 197

原创 machine learning学习笔记（四）逻辑回归

在前面已经学习了线性回归，但是线性回归的应用场景往往是，比如：知道你的身高，你的收入，你的颜值分，你的家底，问你未来能交到的对象的总体评分是多少。我们用离散的数据构造了一个连续的模型，并以此把重心放在“预测数值”上。但是当我们遇到：已知西瓜的颜色，重量，花纹等信息，问是不是好瓜？答案必然是：yes / no.所以逻辑回归就引入了这样的问题，虽然叫做回归，但实际却是在解决分类问题。可是我们拟合出来的图像都是一些线条可怎么进行逻辑性回答呢？y=1,if x>0y=0.5 if x=0y=0

2020-10-05 15:40:43 85

原创 machine learning学习笔记（三）正则化

为什么要正则化？就是为了解决过拟合问题。为啥过拟合？有部分原因就是x1,x2,x3…一大堆特征太多了（一个x代表一个特征）假设有一个只有两个特征x1,x2的模型算出来的非线性方程是↓把模型得到的每个项的参数写成一个矩阵ww0不计入，一个常数，写在外面就行。没有正则化调整之前的误差方程是:引入之后：λ≥0是提前选择的控制强度的超参数。说人话就是，减重。给谁减重？给那些次幂很大，其实没那么重要（甚至导致过拟合）的项加上一个惩罚项来中和它的影响。假设一组数据集算出来的模型有9次

2020-10-02 17:05:44 145

原创 machine learning学习笔记（二）多元线性回归

多元线性回归 Linear Regression线性回归，就是给在平面图上给出一组数据，我想找到一条穿过他们中间的直线，使得每个点到这条直线的距离的和都能最小。如果是多元线性回归，那么意味着将不再是普通的y=wx+b问题，而是y=w1x1+w2x2+…+b的问题。但是，拟合必然有损失，我们定义以下损失函数↓均方误差，顾名思义又算方差，又算平均值。展开来写：最后一步实际上是求偏导。要求这个式子的最小值，也就是他的最小值：我们在高中学过，求最大最小值，就要求导数=0.所以对w进行偏

2020-10-02 16:32:36 202

原创 machine learning学习笔记（一）：信息熵，条件熵，交叉熵，KL散度，互信息

琴生不等式 Jensen由数学归纳法证明对损失函数 Logarithmic functionentropy 信息熵log底数一般为2.信息熵代表着X不确定程度。

2020-10-02 13:58:36 856 1

原创 RDD用法与实例（十一）Spark中master、worker、executor和driver的关系，通过broadcast形成共享和直接生成全局变量的区别

Spark中master、worker、executor和driver的关系参考来源1master是大哥，worker是工人小弟们。master大哥有一个副手老二叫driver，手上管着一批货物executor（memory cpu等资源），老二会问大哥咋分给小弟们呢，大哥说哎，就这么分，张三比较菜（worker）搬两箱（executor），李四能者多劳（worker）搬五箱（executor）。但是重点不是搬货物完了就行，货箱里的材料还需要不断返工和沟通来完成成品制作。就需要小弟们不断沟通，不断请示

2020-09-28 12:14:16 398

原创 RDD用法与实例（十）：spark中rdd实现k-means

import numpy as npdef parseVector(line): return np.array([float(x) for x in line.split()])def closestPoint(p, centers): bestIndex = 0 closest = float("+inf") for i in range(len(centers)): tempDist = np.sum((p - centers[i]) ** 2)

2020-09-28 11:46:49 487

原创 RDD用法与实例（九）Join，Broadcast用法与对比

Joinproducts = sc.parallelize([(1, "Apple"), (2, "Orange"), (3, "TV"), (5, "Computer")])trans = sc.parallelize([(1, (134, "OK")), (3, (34, "OK")), (5, (162, "Error")), (1, (135, "OK")), (2, (53, "OK")), (1, (45, "OK"))])print(trans.join(products).take(

2020-09-27 22:47:28 483

原创 RDD用法与实例（八）reduceByKey 用法,sortbykey，sortby

一、数据集fruits.txtapplebananacanary melongraplemonorangepineapplestrawberry二、赋值并合并相同key例1fruits = sc.textFile('/Users/huangluyu/data/fruits.txt')numFruitsByLength = fruits.map(lambda fruit: (len(fruit), 1)).reduceByKey(lambda x, y: x + y)print

2020-09-27 16:20:43 798

原创 RDD用法与实例（六）：Linear-time selection 线性时间选择找出第K小的数字

思路：1、以第一个数作为初始计算值X，将分区划分成小于X 和大于X 两部分 A1,A22、如果 A1.count + 1 正好是k，说明，第k小的数正是这个将分区恰好划分成[min,k-1] [k+1,max]的数。3、如果 A1.count + 1 小于k，说明我们要找的k在A2中存在，那么对A2再次进行循环划分。如果 A1.count + 1 大于k，说明我们要找的k在A1中存在，那么对A1再次进行循环划分。4、以新的目的分区A1 or A2重复操作，以第一个数作为新的x，将分区划分成…

2020-09-27 14:35:42 439

原创 RDD用法与实例（七）：collect和take

有时候collect结果会很奇怪。尽量用take，因为collect不是很能总能很好的收集数据。如下这个例子中，collect就有了明显的不正确结果。A = sc.parallelize(range(10))x = 5B = A.filter(lambda z: z < x)# B.cache()B.unpersist()print(B.take(10))print(B.collect())x = 3print(B.take(10))print(B.collect())# c

2020-09-27 14:17:57 1280

原创 RDD用法与实例（五）：glom的用法

glom1、glom的作用是将同一个分区里的元素合并到一个array里2、glom属于Transformation算子# Example: glomimport sysimport randoma = sc.parallelize(range(0,100),10) #parallelize 进行并行处理，0~100每隔10步长取一次 print(a.collect()) #没有glom所以不分组print(a.glom().collect())print(a.map(lambda x:

2020-09-27 13:14:29 2518

原创 RDD用法与实例（四）：蒙特卡洛法计算pi并利用mapPartitionsWithIndex进行优化

# From the official spark examples.import randompartitions = 1000n = 1000 * partitionsdef f(_): x = random.random() y = random.random() return 1 if x ** 2 + y ** 2 < 1 else 0count = sc.parallelize(range(1, n + 1), partitions) \

2020-09-27 13:09:53 165

原创 RDD用法与实例（三）：map，mapPartitions和mapPartitionsWithIndex的区别

rdd的mapPartitions是map的一个变种，它们都可进行分区的并行处理。两者的主要区别是调用的粒度不一样：map的输入变换函数是应用于RDD中每个元素，而mapPartitions的输入函数是应用于每个分区。假设一个rdd有10个元素，分成3个分区。如果使用map方法，map中的输入函数会被调用10次；而使用mapPartitions方法的话，其输入函数会只会被调用3次，每个分区调用1次。mapPartitionsWithIndex则是带上分区下标进行操作。# Example: mapPa

2020-09-27 12:56:28 1274

原创 RDD用法与实例（一）基础介绍

##部分材料内容源自于HKUST的课上笔记只有执行actions里的才会最终计算例如↓#Read data from local file system:# sc.textFile 读取数据fruits = sc.textFile('../data/fruits.txt')yellowThings = sc.textFile('../data/yellowthings.txt')print(fruits.collect())print(yellowThings.collect())l

2020-09-27 12:09:57 1126

原创 RDD用法与实例（二） transformation 和 action 和 cache 的实例应用执行顺序分析

rdd = sc.parallelize(range(10))accum = sc.accumulator(0)def g(x): global accum accum += x return x * xa = rdd.map(g)print(accum.value)print(a.reduce(lambda x, y: x+y))a.cache()tmp = a.count()print(accum.value)print(rdd.reduce

2020-09-27 12:07:54 187

原创 networkx基础用法:添加节点、添加边、删除节点、删除边、计算度、赋权重

涉及到的方法：正常安装方法pip3 install networkx如果是pip用户就把pip3改成pip如果报了一堆错看这里↓networkx安装教程添加节点和边的多种方法以及显示import networkx as nxG = nx.Graph() #创建一个空的图形G.add_node(1) #添加节点1G.add_nodes_from([2, 3]) #添加边（2，3）print(G.nodes)# 打印出所有的节点信息G.add_edge(1,2) #添加边的另

2020-09-24 23:56:50 22173 4

原创 mac安装networkx报错response.py“, line 360, in _error_catcher等

环境：MACos pip3（Windows一样）报了很长的错，找到一句代码：pip --default-timeout=100 install -U +库名例如pip --default-timeout=100 install -U networkx报的错误：Traceback (most recent call last):File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages

2020-09-24 22:37:56 440

原创 macos/jupyter/spark 报错：[Errno 8] nodename nor servname provided, or not known

问题描述：在MAC OS 上用jupyter运行spark时，出现错误 sc is not defined，很奇怪这玩意还要defined 吗，后来根据其他blog的描述下载了 findspark ，查出了错误 gaierror: [Errno 8] nodename nor servname provided, or not known基本就是锁定问题关键在mac自己不知道自己主机叫啥名字上了。解决流程可以按照我的试一试，试到哪一步能解决了就可以。基本上是搜集了网络上部分的解决方案。一、安装 f

2020-09-24 19:50:55 2498

原创 Mac安装spark环境国内镜像源过程包括遇到的一些问题

1、打开终端安装 Homebrew/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"国内请用：/bin/zsh -c "$(curl -fsSL https://gitee.com/cunkai/HomebrewCN/raw/master/Homebrew.sh)"最后安装提示安装等待就行。安装完成之后可以使用下面名称测试：brew search red

2020-09-10 15:40:11 355

原创 mac os安装Homebrew报错xcode-select: note: no developer tools were found at ‘/Applications/Xcode.app‘

关键词：mac os安装homebrew 报了和xcode相关的错误报错xcode-select: note: no developer tools were found at ‘/Applications/Xcode.app’, requesting install. Choose an option in the dialog to download the command line developer tools.流程：Mac 安装 Homebrew 出现443错误已经修改镜像一、Homebr

2020-09-10 14:05:45 14306

HKUSTimageprocess图像处理试题20fall

MATLAB实现灰度图片JPEG压缩

空空如也