1a7d5e389b
very minor admin.html edit
Matt Wells
2013-12-10 00:56:56 -07:00
ec2254d8ed
added multi language support note to admin.html
Matt Wells
2013-12-09 23:18:33 -07:00
f7e7acb398
minor log msg updates. updated admin.html to give some performance and storage capacity info.
Matt Wells
2013-12-09 23:16:24 -07:00
95bd6238d9
do not core when running filters when our gb home dir is really long. thanks bill! call XmlDoc::getSpiderPriority() with a SpiderReply so we can act on m_langId, like chinese, for instance, to filter those langs out from indexing. it was doing this before but got commented out for some reason.
mwells
2013-12-09 22:55:02 -07:00
cc63fd048f
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
mwells
2013-12-09 13:46:08 -08:00
e04d596288
minor comments update.
mwells
2013-12-09 13:42:33 -08:00
2a5d4beec4
fix core from last push.
Matt Wells
2013-12-09 14:21:46 -07:00
fa497de217
remove annoying log msg
Matt Wells
2013-12-09 14:09:48 -07:00
44ae7c4de6
mem labelling fixes. fixed bad alloc when generating gigabits.
Matt Wells
2013-12-09 14:05:02 -07:00
0dcd1211d3
new opensource icon.
Matt Wells
2013-12-08 19:47:39 -07:00
92ec3f1148
added open source icon to homepage
Matt Wells
2013-12-08 19:45:49 -07:00
92e3d841a6
minor update
Matt Wells
2013-12-08 19:28:45 -07:00
12404b4f85
doc updates
Matt Wells
2013-12-08 19:26:48 -07:00
dd3b49faa9
collection name hell
Matt Wells
2013-12-08 16:44:37 -07:00
3353a90a85
fix resuming a killed merge condition.
Matt Wells
2013-12-08 15:50:45 -07:00
ed79b67d2e
core dump fixes
Matt Wells
2013-12-08 15:36:23 -07:00
144e2c898e
save resources by not doing reads on an empty doledb priority. stop saving allSpidersOn and Off parms.
Matt Wells
2013-12-08 14:07:31 -07:00
a2e52a5dc3
little fix
Matt Wells
2013-12-08 10:15:54 -07:00
020d7741b9
new coll.conf for main with ismedia filter. updated url filters docs some more for "isnew" and explained the errorcount stuff more.
Matt Wells
2013-12-08 10:10:51 -07:00
65e75167e3
limit posdb merging to 8 files max. added some more url filters documentation.
Matt Wells
2013-12-08 09:41:05 -07:00
78a4cfe6da
forgot to push the .h files
Matt Wells
2013-12-07 22:12:48 -07:00
e1712fc94f
fix uninitialized diffbot titlerec header parms. ignore them when not a custom crawl.
Matt Wells
2013-12-07 22:11:26 -07:00
06edfddf31
a bunch of bug fixes, mostly spider related. also some for pagereindex.
Matt Wells
2013-12-07 21:56:37 -07:00
5e4b5a112c
Merge branch 'master' into diffbot
Matt Wells
2013-12-07 11:34:26 -07:00
105be1fbdc
more core fixes
Matt Wells
2013-12-07 10:38:47 -07:00
8d92a079c2
minor spider error reply time fix
Matt Wells
2013-12-07 10:21:51 -07:00
e731e5a4d8
Merge branch 'diffbot' of git@github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2013-12-07 10:21:21 -07:00
0e846a9389
minor spider reply error fix
Matt Wells
2013-12-07 10:21:02 -07:00
626a97770c
another core fix
Matt Wells
2013-12-07 10:14:37 -07:00
fda7b48500
fix core
Matt Wells
2013-12-07 10:11:13 -07:00
1bc80ab552
fixed pagereindex. we now add spiderreplies for internal errors like ENOMEM or ENOTFOUND to try to avoid the "CRITICAL CRITICAL" msgs. these are considered temporary errors.
Matt Wells
2013-12-07 10:01:17 -07:00
d9b31d3481
quick bug fix
Matt Wells
2013-12-06 22:57:49 -07:00
269c10f648
try to figure out why pagereindex never displayed html page when done.
Matt Wells
2013-12-06 22:56:06 -07:00
522e81913f
another parm overhaul checkpoint
mwells
2013-12-06 17:33:55 -08:00
adf9d807ea
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
mwells
2013-12-06 12:31:36 -08:00
08faf78be9
checkpoint for new parm logic to allowing syncing with newly added or deleted collections even if a host was dead when collection was added/deleted. also added parm change request queueing.
mwells
2013-12-06 12:29:14 -08:00
e7bd904765
fix docids only printing.
Matt Wells
2013-12-06 09:53:32 -07:00
c50ef1954f
show admin controls on serps if ip is local. fixed up the "reindex" page for deleting/reindexing search results for a given query.
Matt Wells
2013-12-06 09:48:30 -07:00
4b3e111bed
fix spider dumping to remember uh48's between list readings. was showing dups for www.nordicusa.com/webtv at the end.
Matt Wells
2013-12-05 10:09:06 -08:00
99cc10fccd
allow seed urls to match url crawl pattern regardless.
Matt Wells
2013-12-03 17:13:38 -08:00
432099c4e6
added rebuild=true fix for regex crawl change
Matt Wells
2013-12-03 16:23:58 -08:00
2e46bcc97f
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2013-12-03 16:23:20 -08:00
03219a3057
add regex support back in
Matt Wells
2013-12-03 16:23:05 -08:00
6ab9041f45
fix bug when just getting the crawl parms was rebuilding the waiting tree.
Matt Wells
2013-12-03 16:17:36 -08:00
9f1d79b124
check for null collrec
Matt Wells
2013-12-02 10:13:19 -08:00
cda5968b75
update common word list
Matt Wells
2013-12-01 15:19:33 -07:00
39f8dc646b
default gigabits on for my copy.
Matt Wells
2013-12-01 15:07:06 -07:00
7f4dca7a07
Merge branch 'master' of git@github.com:gigablast/open-source-search-engine
Matt Wells
2013-12-01 14:47:16 -07:00
7874c8d832
added ifdef NEEDSLICENSE
Matt Wells
2013-12-01 14:47:08 -07:00
d43b55103c
show query in msg20 log msg
Matt Wells
2013-12-01 12:11:25 -07:00
1077191e4a
fix log msg bug.
Matt Wells
2013-12-01 12:08:05 -07:00
08030865e4
fix compiler warning
Matt Wells
2013-12-01 11:57:26 -07:00
d811a13627
fix small oopsy
Matt Wells
2013-12-01 11:56:33 -07:00
3155869fbf
added new log msg for recording cpu time for summary generation.
Matt Wells
2013-12-01 11:53:41 -07:00
5ee2be8fcf
fixed data corruption bug. m_finalCrawlDelay was being stored in xmldoc titlerec header.
Matt Wells
2013-11-27 14:18:15 -08:00
1129e9b635
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2013-11-27 14:09:54 -08:00
57eb231a4e
do not add timestamps to lastdownload cache if skiphammercheck is true. those are like robots.txt or redirs or root files.
Matt Wells
2013-11-26 14:21:17 -08:00
0f3374e3f3
measure crawl delay by default from start of each download now. it is a parm in msg13request.
Matt Wells
2013-11-26 14:07:28 -08:00
4769ca0881
if pthread_create() returns EAGAIN then do not always retry, it makes an infinite loop.
Matt Wells
2013-11-26 14:52:07 -07:00
8bb086ac60
crawldelay works now but it measures from the end of the download, not the beginning.
Matt Wells
2013-11-26 12:58:14 -08:00
1c7c9a4d80
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2013-11-26 09:19:26 -08:00
040bdb8039
fix url filters formulation. fixed extra , in json. fixed upp and ucp patterns if all substrings are negative.
Matt Wells
2013-11-26 09:17:38 -08:00
ca544ddb90
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2013-11-25 15:06:11 -08:00
1bbbcff755
fix getTokenizedDiffbotReply() to look for type: with a {} depth of 1 so it does not pick up on the type:image in the images array if there is one in the article.
Matt Wells
2013-11-25 13:58:31 -08:00
61ce4be279
fix major bug when you have twins/mirrors. queries not returning all the results.
Matt Wells
2013-11-25 09:53:53 -07:00
9a456de178
minor fix
Matt Wells
2013-11-24 20:48:47 -07:00
5da41cd113
fix a couple different cores.
Matt Wells
2013-11-24 19:46:44 -07:00
41ce557627
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2013-11-22 18:26:53 -08:00
e8065a0f0a
enforce crawl delay perfectly.
Matt Wells
2013-11-22 18:26:34 -08:00
1826860094
forgot to add diffbot api url parm
Matt Wells
2013-11-22 17:55:37 -08:00
f235a20752
add ! support to all patterns
Matt Wells
2013-11-22 17:52:14 -08:00
c3517ee019
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2013-11-22 17:37:42 -08:00
bc251e17f5
hosts.conf fix
Matt Wells
2013-11-22 14:18:03 -08:00
791036aabb
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2013-11-22 14:17:34 -08:00
3cc300bf03
spider log debug msg fix. boost max cpu threads to 10, seems to have many cores usually.
Matt Wells
2013-11-22 14:17:10 -08:00
e0a15194e1
fix json double decoding issue. no more partial decodes, json parser stores fully decoded string into separate buf.
Matt Wells
2013-11-22 14:16:14 -08:00
6b36ddfd31
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2013-11-22 11:14:35 -08:00
9d9a976b4f
fix bug of perpetual round incrementing ad nauseam.
Matt Wells
2013-11-22 11:14:03 -08:00
c8da2a5af7
fix core
Matt Wells
2013-11-22 09:47:12 -07:00
8a58969ab8
try to fix core. log redirects.
Matt Wells
2013-11-22 00:41:33 -08:00
79df39655f
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2013-11-21 12:38:03 -08:00
2a5d92a639
log debug update.
Matt Wells
2013-11-21 12:37:53 -08:00
f4de986c7e
test to make sure diffbot reply contains "url":" field. try to find out why some diffbot replies are truncated.
Matt Wells
2013-11-21 12:37:08 -08:00
14e2164acd
oopsy
Matt Wells
2013-11-20 23:40:30 -07:00
acac80d4a9
fix core in summary generation highlighting.
Matt Wells
2013-11-20 23:38:28 -07:00
4a83415832
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2013-11-20 16:44:41 -08:00
dcae4682e8
new api. tossed action/expression and added urlCrawlPattern/urlProcessPattern/apiUrl
Matt Wells
2013-11-20 16:41:28 -08:00
6f4508c8f1
fix issue of bulk job spidering links because of a simplified redirect.
Matt Wells
2013-11-20 16:09:50 -08:00
43e40208b8
Merge branch 'master' into diffbot
Matt Wells
2013-11-20 15:51:58 -08:00
d2751211fe
do not spider links in XmlDoc::spiderLinks() if its a custom bulk job. put in logIt() too.
Matt Wells
2013-11-20 15:46:17 -08:00
9489ce6832
now show json items in csv with aligned columns. use search requests as the way to export data now.
Matt Wells
2013-11-20 10:45:10 -08:00
cbc1303a2a
make performance table taller. we are losing graphical data still.
Matt Wells
2013-11-20 10:10:40 -07:00
5baf6a95d4
handle a bunch of oom conditions that caused core. found using oom tester.
mwells
2013-11-20 10:14:02 -07:00
46a683a904
label the bigger safebuf chunks of mem so we can see a better breakdown of mem on the stats page, not just a big "SafeBuf" allocation.
mwells
2013-11-19 23:53:40 -07:00
ff8491c50e
set g_errno in getCollRec()
Matt Wells
2013-11-19 15:49:32 -08:00
b467f70782
fix hosts.conf
Matt Wells
2013-11-19 15:41:03 -08:00
ec4d77f00a
make waiting trees grow dynamically to save space. was taking like 1.5GB of ram for like 100 collections or so.
Matt Wells
2013-11-19 15:23:25 -08:00
c669f8c138
fix file descriptor leak in Dir class. try to fix core from Thread getting SIGALRM. try to set NOFILES to 1024 at startup in case more are allowed.
Matt Wells
2013-11-19 13:41:56 -08:00
35d22bd9aa
fix json parser
Matt Wells
2013-11-19 09:44:42 -08:00